Exact Weighted Minwise Hashing in Constant Time

نویسنده

  • Anshumali Shrivastava
چکیده

Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large scale-search and learning. The resource bottleneck of the algorithms is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. The fastest hashing algorithm is by Ioffe (Ioffe, 2010), which requires one pass over the entire data vector, O(d) (d is the number of non-zeros), for computing one hash. However, the requirement of multiple hashes demands hundreds or thousands passes over the data. This is very costly for modern massive dataset. In this work, we break this expensive barrier and show an expected constant amortized time algorithm which computes k independent and unbiased WMH in time O(k) instead of O(dk) required by Ioffe’s method. Moreover, our proposal only needs a few bits (5 9 bits) of storage per hash value compared to around 64 bits required by the stateof-art-methodologies. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe’s method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient “densified” one permutation hashing schemes (Shrivastava & Li, 2014a). Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simple and Efficient Weighted Minwise Hashing

Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. We propose a simple rejection type sampling scheme based on ...

متن کامل

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very ecient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular rest...

متن کامل

Optimal Densification for Fast and Accurate Minwise Hashing

Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification (Shrivastava & Li, 2014a;c) have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic impr...

متن کامل

One Permutation Hashing

Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 per...

متن کامل

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

ABSTRACT Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [27] demonstrated a potential use of b-bit minwise hashing [26] for batch learning on large data. However, several critical issues must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the cont...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1602.08393  شماره 

صفحات  -

تاریخ انتشار 2016